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Abstract 

We describe the development of a new toolkit for data analysis. The analysis 
package is based on Bayes' Theorem, and is realized with the use of Markov 
Chain Monte Carlo. This gives access to the full posterior probability dis- 
tribution. Parameter estimation, limit setting and uncertainty propagation 
are implemented in a straightforward manner. A goodness-of-fit criterion is 
presented which is intuitive and of great practical use. 
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1. Introduction 

The goal of data analysis is to compare model predictions with data, and 
draw conclusions either on the validity of the model as a representation of 
the data, or on the possible values for parameters within the context of a 
model. The paradigm for our data analysis package is shown in Fig. [TJ Here, 
the model M can range from a model purporting to represent nature (e.g., 
the Standard Model in particle physics) to a simple parametrization of data 
useful for making predictions or for summarizing data. 

1.1. Modeling 

The theory or model can be used to provide direct probabilities; i.e., rel- 
ative frequencies of possible outcomes of the results were one to reproduce 
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Figure 1: Paradigm for data analysis. Knowledge is gained from a comparison of model 
predictions with data. Intermediate steps may be necessary, e.g., to model experimental 
conditions. 



the experiment many times under identical conditions. This is possible be- 
cause the model is a mathematical construction which allows the calculation 
(or simulation) of frequencies of outcomes. The predictions from the model 
cannot usually be directly compared to experimental results. An additional 
step is needed, either to modify the predictions to allow for the experimental 
effects, or to undo the experimental effects from the data. Obviously, an ac- 
curate description of the experimental effects is necessary to produce reliable 
conclusions. 

The function g(y\X,M) gives the relative frequency of getting result y 
assuming the model M and parameters A. It should satisfy: 

g{y\X,M)>0 (1) 

and 

Y,g{Vi\\M)=l or f g(y\\,M)dy = l (2) 

i J 

depending on whether discrete or continuous values are measured. In the 
following, we will write formulae for the continuous case; the modification 
for the discrete case will be clear. Note that the normalization requirement 
is often discarded when only relative probabilities of outcomes are needed. 
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The modeling of the experiment will usually add extra parameters and 
assumptions. We will use the symbol v to represent these additional (nui- 
sance) parameters. There could also be additional information not included 
explicitly in the model which could limit values of the model parameters. 
This information is often denoted with an additional symbol, e.g., J, and we 
would then have f(x\\,V,M,I) for the frequency distribution of observable 
quantities x. We will drop the I in the following - it is assumed everywhere 
that all available information is used in the probability distributions. 

As an example, one can consider the radioactive decay of an unstable nu- 
cleus. The model is an exponential decay time distribution, P(t\r) = \e^ t l T , 
with the lifetime parameter r. Then 

9{{U}\r) = \{P{U\r) = \\\e-^ . 

i i 

We can imagine that we are measuring a sample of nuclei, and that our 
detector is not able to distinguish two decays which occur too closely in 
time. Also, the decay times will be measured with some precision which 
could be modeled with a Gaussian smearing with width a t . This measure- 
ment resolution may not be known precisely, and could be considered a nui- 
sance parameter. The probability density for a given set of measured times 
/({£™ easured }|r, at) could then be determined from a simulation which includes 
the dead time and measurement resolution. 

In general, the judgment on the validity of a model and the extraction of 
values of the parameters within the model will be based on a comparison of 
the data, D, with f(x\\, v, M). 

1.2. Formulation of the Learning Rule 

The probability of a model, M, will be quantified as P(M), with 

< P(M) < 1 (3) 

while the probability density of the parameters are typically continuous func- 
tions. The parameters from the modeling of the experimental conditions are 
not correlated to the parameters of the physical model so that 

P(X,u\M) = P{\\M)P{v) . (4) 
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The probability densities satisfy 



P(\\M) > (5) 
P(X\M)dX = 1 (6) 

(7) 

and similarly 

P(P) > (8) 
P{y)dv = 1 . (9) 



In the Bayesian approach, the quantities P(M) and P(X\M) are treated 
as probabilities (probability densities), although they are not in any sense fre- 
quency distributions and are more accurately described as degrees- of-belief 
This degree- of-belief is updated by comparing data with the predictions of 
the models. The parameter and model degrees- of-belief are the interesting 
quantities since they contain our knowledge about nature. The purpose of 
performing experiments is then to modify our degree- of-belief. P(M) = 1 
represents complete certainty of M being correct, and P(M) = represents 
complete certainty that M is false. 

The procedure for learning from experiment is: 

P i+1 (X, v, M\D) oc f(x = D\X, v, M)Pi(X, P, M) , (10) 

where the index on P represents a state- of -knowledge. 
In order to satisfy our normalization requirement 



J P(X,u,M)dXdu = J2 P ( M ) J P(MM)dX J P{V)dV 



M " M 

we have 



P i+1 (UM\D) = , . (12) 

J2 M ff(x = D\X,v,M)P i (\,v,M)dXdu 

We usually just write Pi = P , and this quantity is called the prior. It con- 
tains all information we may have on the model and parameter values before 
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the experiment is performed. The posterior probability density function, 
Pj+i, is usually written simply as P. It describes the state of knowledge after 
the experiment is analyzed. 

For a given model, M, f is a function of the model parameters, the 
experimental parameters, and the possible outcomes x. When / is viewed 
as a function of (A, v) for fixed x — D, it is known as the likelihood. In our 
formulation, / is the relative frequency of a particular result x = D from our 
modeling. If / is normalized, we can write 

P(D\X,u,M) = f(x = D\\,v,M) . (13) 

The denominator in Eq. ffT2l) is the probability to get the data, D, assum- 
ing the models M and the description of the experimental conditions describe 
all possible outcomes, and can be written as P(D). Using this notation, we 
then recover the classic equation due to Bayes: 

P{Um = Pmp '^ P{UM) - (14) 

P(D) 

The scheme for updating knowledge, as written down here, is general and 
straightforward. There can be difficulties in implementation when dealing 
with mult i- dimensional spaces, and the toolkit presented here is designed to 
help the user solve these technical issues. It is up to the user to carefully 
define the function / as well as the priors. The precision and accuracy of the 
results will depend primarily on the quality of the inputs. Given that these 
functions are well-defined, the BAT program will then be useful as a tool for 
model testing and parameter estimation. 

In the following, we start with a discussion on how to perform param- 
eter estimation with the model fixed. This is followed with a discussion of 
goodness-of-fit criteria, and the closely related topic of model comparison. 
Discovery criteria are discussed in this context. Setting limits on parameters 
while keeping the model fixed is briefly discussed. We then switch to the 
practical realization of this analysis framework in terms of Markov Chains. 
A brief review of Markov Chains is given, followed by a discussion of the im- 
plementation in BAT. We then give examples of parameter estimation and 
model testing to make the ideas more concrete. 
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2. Parameter Estimation 

Parameter estimation is performed while keeping the model fixed. In this 
case, we write 

P(A, m M) = ^-^MWX, g |M)_ 

fP(x = D\X,P,M)P {X, V\M) dXdv 

The output of the evaluation is a normalized probability density for the 
parameters, including all correlations. This output can be used to give best- 
fit values, probability intervals for the parameters, etc. 

The function, P(x = D\X,P,M), which gives the probability density at 
x = D given the model M and the parameters (A, P) must be defined by the 
user, and must return reasonable values for all possible data results. This 
likelihood function contains the information both from the theory input as 
well as the modeling of the experimental conditions. It must be defined case- 
by-case^. Similarly, the prior probabilities, P (X,P\M), must be defined by 
the user. These prior probability functions define the range over which the 
parameters can vary, and any preconceptions concerning their possible val- 
ues. Note that if these priors are set to a constant (not always applicable!), 
then the parameter estimation using the mode of the posterior probability 
density function (pdf) is equivalent to a maximum likelihood estimation. If 
in addition the fluctuations of the data from the model predictions are as- 
sumed to follow Gaussian distributions, then the mode finding reduces to a 
X 2 minimization. The formalism therefore contains these commonly used fit- 
ting approaches. It is however completely general and is not limited by these 
conditions. In all cases, the full posterior pdf is available, thus allowing the 
user to study correlations between parameters, to perform the propagation 
of uncertainties without approximations, and to define uncertainty bands or 
limits as desired. The usefulness of this information is demonstrated in the 
examples section. 

When working with the posterior pdf, P(X, P\D,M), it is often the case 
that one is interested not in the full pdf but in the probability distribution 
for only one, or a few, parameters. These distributions are determined via 



Some applications, such as histogram fitting using products of Poisson distributions 
or curve fitting using products of Gaussian distributions can be performed in BAT with 
minimal effort. 
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marginalization. For example, the probability distribution for parameter Aj 
is: 



P(\i\D, M) — I P(X, u\D, M) dX^i dv . 



(16) 



Note that the parameter values which maximize the full posterior pdf usually 
do not coincide with the values which maximize the marginalized distribu- 
tions. 



2.1. Use of the Posterior Probability Distribution 

The posterior pdf, P(\,u\D, M), can be used to evaluate any desired 
quantity related to the parameters. For example, commonly used quantities 
are: 



< A* >= / P(X i \D,M)X i dX i . 



Mean of A i . 



Median of Aj . The value of Aj such that 50% of the probabilty is below this 
value 



/ 

" ^min 



P(Xi\D,M) dXi = 0.5, 



where A m i n is the minimum possible value for parameter Aj. The desired 
value is A me d- 

Mode o/Aj. The value of Aj which maximizes the marginalized posterior pdf 



argmax 



P(Xi\D,M) 



Note that the mode can be evaluated for any number of parameters with the 
rest marginalized. The most common uses are the mode of the full pdf and 
modes evaluated for one parameter at a time. 

rms. The root-mean-square is defined as usual 



rmsj 



P(X i \D,M)X 2 i dX i 



J P(X t \D,M)X t 



dX, 
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Central Interval. The (1 — 2a) central interval is defined such that a fraction 
a of the probability is contained on either side of the interval 

r -Slower Mmax 

a= P(\i\D,M)d\i= / P(\ i \D,M)d\ i , 

^ -^min ** -^uppcr 

where the desired interval is [Ai owcr , A U p per ]. The minimum and maximum 
allowed values of the parameter are A m i n and A max , respectively. 

Smallest Interval. The smallest set of A containing a of the probability. The 
set satisfies P(\\D,M) > P m i n , where P min is defined as 

rP{\*\D,M) 

/ p(P(X\D,M))dP(X\D,M) = a, 

where A* maximizes the full posterior pdf. The expression p(P(X\D, M)) is 
the probability density of the posterior pdf. 

Correlation. The correlation coefficient between two parameters, Aj, Xj, can 
easily be evaluated 

< X i X j > - < Aj >< A,- > 

Pa = " — 

rmsi rmsj 

f f P(Xi,Xj\D, M)X i X j dXi dXj - J P(\\D, M)XidXi f P(X j \D,M)X j dX j 

rmsi rmsj 

As should be clear, the posterior pdf contains all the information one 
could wish for, and it is just a matter of defining which quantities are of 
interest. A standard output of an analysis could be: 

• the modes of the parameters from the full posterior pdf 

• the mean and mode of each parameter; 

• the rms of each parameter; 

• the correlation coefficient between each pair of parameters; 

• the central 68% and 90% probability intervals and narrowest probabil- 
ity intervals for each parameter. 
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2.2. Using the Posterior for Uncertainty Propagation 

Having full access to the posterior pdf allows for the evaluation of any 
function of the parameters, and the evaluation of the probability distribution 
of that function. In contrast to standard techniques used for propagation of 
uncertainties, there is no need here for any approximations. As an example, 

— * 

consider a function y of the variable x which depends on the parameters A, 

— # 

such as a linear function, y = mx + b, with A = (m,b). To evaluate the 
probability density distribution for y at any x, we have 

P{y{x)) = J P(m,b)S(y = mx + b) almdb . (17) 

This can then be used to find any of the quantities of interest regarding the 
function y at any value of x, such as for example the central 68% interval: 

0.16= / P(y(x))dy = / P(y(x))dy 

J ^min " 2/upper 

with the central 68% interval given by (yiower, 2/ upP cr)- 



3. Model Validity 

3.1. Definition of p-value 

Model testing in a strictly Bayesian approach is problematic since there 
is often no way to define all possible models, and the results depend critically 
on the choice of priors. However, having a number representing how well the 
model fits the available data is important. In the BAT toolkit, a p-value is 
defined which gives a goodness-of-fit criterion based on the likelihood of the 
data in the model under consideration using the parameters defined at the 
mode of the posterior. We define the following quantities: 



/*(£) = P(x\\*,i?,M) (18) 
f D = P(D\\*,i?,M) = r(x = D), (19) 

where (A*, V*) is the set of parameter values at the mode of the full pdf. We 
then define the following quantity to evaluate the validity of a model, M: 

_ SfjzxfD P(. x )dx 
P ~ Jf*(x)dx • 1 Uj 
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The quantity p is the tail-area probability to have found a result with 
/* < f D i assuming that the model M and the parameters (A*, P*) are valicjl. 
This quantity is analogous to the well-known x 2 probability Note that before 
the experiment is performed, p can be viewed as a random variable. The 
definition of p can be rewritten as: 

p(d)= / p(r(x))dr(x) (21) 

Jo 

so that p is just the value of the cumulative distribution function for P(f*(x)) 
evaluated at x — D. This implies that, assuming the modeling is valid, p will 
have a flat probability distribution between [0, 1], and is therefore well-suited 
as a goodness-of-fit quantity. It is the probability that the likelihood could 
have been lower than the one observed in the data, assuming the model and 
parameter values are correct. If the model does not give a good representation 
of the data, then p will be a small number. This definition is of course only 
valid if the parameters are not adjusted with the data at hand. I.e., the 
argument given is strictly only valid for the model with the current values 
of the parameters compared to a future data set. Since the existing data is 
used to modify the parameter values, the extracted p-value will be biased to 
higher values. The amount of bias will depend on many aspects, including 
the number of data, the number of parameters, and the priors. The user will 
need to keep this in mind when using the p-value. However, we feel that 
the p- value is nevertheless a useful quantity for evaluating how well a model 
represents data. 

3.2. Model Comparisons and Discovery Criteria 

Model comparison and deciding on a discovery are related topics. Model 
comparison can be performed either via the p-value described above, or via a 
probability calculated from the posterior pdf. There are different possibilities 
for formulating the discovery process. One possibility is to perform hypoth- 
esis testing, and to declare a discovery if the probability of the hypothesis 
'the standard physics model contains all processes' is small. Another is to 
calculate the p- value for the 'standard model' to explain the observations, 
and compare this to the p- value of a model including new physics. 



2 A different definition used in Bayesian data anafysis does not fix the parameter value 
to the mode, but rather weights them with the posterior pdf [2| . 
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Comparison via p-values. Model comparison is easily performed by compar- 
ing the p-values for different models. Large p-values imply that the model is 
a good representation of the data. If there are several models available, then 
the one with the largest p-value gave the best representation, although all 
models with reasonable p-values (e.g., larger than 0.1) should be considered 
to give good fits. Note that implementing Occam's razor is straightforward 
- one would choose the simplest model which gives a good p-value. If the 
p-value for the standard physics model is very small and the p-value for a 
model containing the new physics is reasonably large, then a discovery could 
be claimed. 

Comparison via absolute probability. Models can also be compared by calcu- 
lating the absolute probability of each model: 

P(M\D) = J P(M,X,u\D)dXdu (22) 
/ P(D\X, u, M)P (X, V, M) dXdv 



Em / P(D\\, V, M)P {\, P, M) dXdv 
J P(D\X, z7, M)P (A, u\M) dXdi7P (M) 
J2m ] P(D\X, v, M)P (X, V) dXdv P (M) 



(23) 
(24) 



This approach has the advantage of being a true probability (degree- of-belief) , 
but it requires the specification of a full set of models giving 

^P (M) = 1 (25) 

M 

and is often very sensitive to the definitions of the priors. It has also a 
technical disadvantage in that the denominator must be calculated (it is 
not necessary for the other quantities discussed thus far). An example of 
this approach is given in [jjj], where the search for neutrinoless double beta- 
decay is described. There, only two possibilities are allowed - there are only 
background processes, or, there are background processes and additionally a 
neutrinoless double beta-decay signal. It is possible to formulate the priors 
and likelihoods for each of these cases, and a discovery criterion can be given 
as P(background only|P) < -^threshold, where P t hreshoid is a small value. 
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4. Setting limits on Parameters 



Setting limits on parameters is normally straightforward, and just requires 
integration of the posterior pdf. For example, setting a 90% upper limit on 
parameter Aj would require solving 

r -^uppcr 

0.9 = / P(Xi\D,M)dXi 



" ^min 

for A upper . However, there are situations in which the posterior cannot be 
evaluated because there is great uncertainty on the form of the prior, and 
the posterior pdf depends strongly on the form chosen. This type of situation 
can arise when probability distributions for parameters of a new, unproven, 
model are desired, and the goal is to rule out certain parameter ranges for 
the model. One possibility in this case is to use the ratio of likelihoods as 
proposed by d'Agostini 

R(X\D,M) = P ^ M \ 
P{D\S) 

where S represents the standard model. One can then choose by conven- 
tion to say that parameter values yielding R < R cut are ruled-out (better, 
disfavored). A possible choice for R cut is 0.1. This does not mean that the 
parameter values giving small R are ruled out with a certain probability - 
there is not enough information to say this - but the degree- of -belief m these 
values is certainly low. 
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5. Markov Chain Monte Carlo 



The posterior pdf, Eq. (jTOl) . is determined using a Markov Chain Monte 
Carlo {MCMC) [|. We give a brief review of Markov Chains, and the par- 
ticular algorithm used to realize the chain. 

5.1. Markov Chains 



Markov Chain 




MX,?)~ P(i,7\D,M) 



Random numbers 



Figure 2: Scheme for producing a Markov Chain with stationary distribution 
P(X, V\D,M). Independent, identically distributed random numbers are used to create 
random numbers, or vectors of numbers, according to any desired distribution. 

Markov Chains are sequences of random numbers (or vectors of numbers), 
X t , which have a well-defined limiting distribution, ir(x). The fundamental 
property of a Markov Chain is that the probability distribution for the next 
element in the sequence, X t+1 , depends only on the current state, and not 
on any previous history. A Markov Chain is completely defined by the one- 
step probability transition matrix, P(X t+ i = y\X t = x). Under certain 
conditions (recurrence, irreducibility, aperiodicity), it can be proven that the 
chain is ergodic; i.e., that the limiting probability to be in a given state does 
not depend on the starting point of the chain. The probability to be in a 
given state is then stationary. An MCMC is a method producing an ergodic 
Markov Chain which stationary distribution is the distribution of interest. 
In our case, we produce a Markov Chain where the stationary probability 
density is the desired posterior pdf, i.e., n = P(X, u\D, M). A graphical 
description is given in Fig. [2j 

The original and most popular algorithm which achieves this is the Metropo- 
lis algorithm j^]. 
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5.2. Metropolis Algorithm 

The algorithm works as follows: 

1. Given that the system is in state X t — x, a new proposed state, y, is 
generated according to a symmetric proposal function, g(y,x). In our 
application, a state is a particular set of parameter values. 

2. The quantity 




is then calculated, and compared to a random number U, generated flat 
between [0, 1]. If U < r, then we set X t+ i = y, else, we take X t+ x = 

It is possible to show that, given a reasonable proposal function g, this algo- 
rithm satisfies the conditions of an MCMC, and that the limiting distribution 
is 7r(x). This then allows for the production of states distributed according to 
the desired distribution. In particular, this allows for the generation of ran- 
domly distributed system states according to complicated probability density 
functions which have no analytic form. All that is required is that ir(x) can 
somehow be calculated. 



6. Implementation 

The requirements for a technical implementation of the Bayesian infer- 
ence described above are (i) a flexible framework which allows to formulate 
the models developed, and (ii) a reliable and fast code for numerical op- 
erations such as optimization, marginalization and integration. The BAT 
software framework was designed to fulfill both requirements. Its technical 
implementation is described in the following. 

6.1. Framework 

The BAT software framework is C++ based code which comes in the 
form of a library. It offers classes which are used to phrase the problems 
encountered and to perform numerical operations. The framework is inter- 
faced to software packages such as ROOT 0, Minuit j?]], or the CUBA 
library (§]. It offers several output formats, e.g., plain ASCII files, ROOT 
trees and graphical displays. The BAT library can be linked against in a 
standalone program or be loaded into ROOT. 
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The formulation of models and their corresponding terms, as, e.g., in 
Eq. fll2p . is done in the form of methods which belong to the above men- 
tioned classes. These are defined by the user. For simple applications such 
as fitting a histogram or a function with Poissonian or Gaussian uncertain- 
ties, the BAT software provides predefined classes which can be used with 
minimal programming effort. 

A detailed description of the class structure and the methods is given 
in 1. 



6.2. Numerical Implementation 

The numerical operations which need to be performed during the analysis 
are optimization, marginalization and integration. Several algorithms for 
each operation are either implemented in the current version of the code or 
planned for later versions. Emphasis is placed on the MCMC. 

MCMC. The posterior probability density of Eq. ffT5l) is sampled using an 
MCMC. A pre-run of the MCMC is performed before the sampling and analy- 
sis run in order to ensure convergence and to find reasonable run parameters. 

For the pre-run, several chains are run in parallel with random starting 
points in the allowed range. The steps in parameter space are done consec- 
utively for each parameter and chain. The proposal function for new steps 
is chosen flat by default with a range initially set to the width of the al- 
lowed range of the parameter. An iteration is defined as the set of steps 
from an update of the first parameter of the first chain to the last param- 
eter of the last chain. The efficiency for accepting or rejecting new points 
is evaluated separately for each parameter and chain over several iterations 
(default: 1, 000). The proposal function ranges are updated every 1, 000 iter- 
ations until an efficiency between 10% and 50% is found for each par ameter. 
The convergence of Markov chains is monitored via the r- value 10j, which 



is defined as r = yV/W, where V is the estimated variance of the target 
distribution and W is the mean of variances of all chains. The quantities are 
defined as follows: 

W = 



j j m n 



j=l i=l 
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m 

n m — 1 ^— ' 

3=1 

where m is the number of chains run simultaneously (default: 5), n is the 
number of elements in one chain for which the variance is estimated (default: 
1, 000 iterations), and x is a quantity of interest. 

Once convergence is reached both V and W should be the same, i.e., 
r ~ 1. The convergence criterion is set by default to (r — 1) < 0.1 has to 
be fulfilled for all parameters simultaneously, as well as by the posterior pdf. 
The pre-run is performed for a minimum number of iterations (default: 500) 
and is continued either until the chains converged and the efficiency of each 
parameter is found in the required range, or until a maximum number of 
iterations is reached (default: 1, 000, 000). No output is produced during the 
pre-run. 

The sampling and analysis run is performed for a defined number of iter- 
ations (default: 100, 000) with run parameters found in the pre-run. Several 
operations are performed during the sampling: the global mode is compared 
to the current point, histograms are filled for the marginalized distributions 
(Eq. ( fl6|) ). and, if applicable, the probability distribution for the function 
being fitted to the data is evaluated. A user interface allows to perform fur- 
ther operations during the sampling such as evaluating arbitrary functions 
of the parameters. 

Optimization. The BAT package extracts the mode as a standard output. In 
case this is the only information from the posterior pdf of interest, an interface 
to the ROOT version of Minuit can be used. All necessary information, 
such as the ranges of the parameters and the function itself, are automatically 
transfered to Minuit. The Minuit run parameters can be adjusted by the 
user. 

Integration. For model comparisons via the posterior probability of the mod- 
els the denominator of Eq. (122]) has to be evaluated. If no analytical ex- 
pression is provided by the user, a numerical integration over the posterior 
probability is performed. In the current version the implemented algorithms 
are sampled mean with and without importance sampling. Interfaces to tools 
such as the CUBA library are available. 
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Evaluation of the validity of a model. The validity of a model can be evalu- 
ated with the prescription given in section [3J In cases where the likelihood 
is the product of a finite number of simple expressions, BAT can generate 
ensemble data sets with the parameters which maximize the posterior pdf. 
Subsequently, the likelihood is calculated for those data sets and the p-value 
is estimated. In case the likelihood is more complicated, an external genera- 
tor has to be used. The ensemble data sets can be read into BAT and used 
for the analysis. 

For the cases where the likelihood is a product of Gaussian or Poissonian 
distributions, a fast evaluation of the p-value is possible in BAT. 

6.3. Output 

BAT provides the following output by default: 

• a plain ASCII file which summarizes the models and their parameters. 
It displays the results of the analysis such as the global mode, the mean 
and modes of the marginalized distributions, or the uncertainties on the 
parameters; 

• ROOT trees which store the summary information. The information of 
several runs with, e.g., different data sets can be stored and compared; 

• ROOT trees containing the Markov Chains. All points of the chains 
can be stored together with an index and the posterior pdf at that 
point. This allows the chains to be analyzed offline; 

• histograms of the marginalized distributions (ID) and histograms con- 
taining the correlation between two parameters (2D). These are stored 
in the form of ROOT histograms. 

7. Examples 

As examples of the use of BAT, we consider the analysis of two similar 
data sets generated using the same functional form. In the first case, the 
data fluctuations are taken to be Gaussian while in the second case the 
fluctuations are generated using a Poisson distribution. The same models are 
used to fit to the data in each case, but the conclusions are quite different, 
driven in large part by the different uncertainties assigned to the data points. 
We first discuss the Gaussian fitting example in some length, describing the 
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fits and the standard BAT output. The wealth of information which can 
be extracted from the Markov Chain is apparent already in these simple 
examples. A comparison of the use of the p-value proposed in this paper 
with the posterior probability of the models is also discussed. 

7.1. Example: Gaussian Uncertainties 

The data shown in Fig. [3] were fitted by several models. The data point 
Hi at a given Xi was generated by sampling from a Gaussian distribution with 
mean f(xi) and fixed standard deviation of s = 4 units, where 

f{xt) =A + Bx i + Cx 2 i + — ^ =e " l£ ^- , (26) 

and (A = 0, B = 0.5, C = 0.02, D = 15, a = 0.5, /i = 5.0). The data are 
plotted in the usual way, with the bar on each data point indicating the size 
of the data uncertainty (in this case the value used to generate the data). 




Figure 3: The data set used in the example fits described in the text. The function from 
which the data points were generated is shown as a dashed blue line. 
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The data were fitted using the following models: 



1. second order polynomial; 

2. constant plus Gaussian; 

3. linear plus Gaussian; 

4. quadratic plus Gaussian. 

The likelihood function was taken to be 



where f(xi) is evaluated using the current values of the parameters A in the 
model. Table [I] summarizes the parameters available in each model and the 
range over which they are allowed to vary. In all models, flat priors were 
assumed for all parameters. An example with non-flat priors is discussed in 
a later section. 

The fits themselves are shown in Fig. HI while the results for the mode, 
the mean and the central 68% intervals on the parameters are given in Tab. [U 
The solid lines in Fig. H] are the functional forms evaluated with the param- 
eters set to the global mode values. For the calculation of the uncertainty 
band, the output of the Markov Chain was used to produce a distribution of 
y-values from the model at different x-values. The central 68% probability 
interval of these y-values then defines the uncertainty band. We note the 
following: 

• the fit functions all give a reasonable description of the data within the 
constraints allowed by the functional forms; 

• the values of the functions evaluated at the parameter values from the 
global mode can lie outside the central probability interval (e.g., the 
peak region in model III). 

The possibility of having multiple modes in the probability distributions 
as well as a different definition of the uncertainty interval will be discussed 
later. 

7.1.1. The Posterior pdf 

The BAT program outputs a file containing sets of parameter values 
distributed according to the full posterior pdf. In addition, the BAT program 
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Table 1: Summary of the models fitted to the data, along with the ranges allowed for the parameters. The value of the 
parameter from the global mode, the mean (or 95% probability limit for a parameter if it is at the edge of the allowed range) 
and the central 68% probability interval from the marginalized distribution are given. The last column gives the p-value of 
the fit. 





Figure 4: Graphical display of the model fits to the data. The solid lines give the models 
with the parameters set to the global mode values. The band is calculated as explained 
in the text. 
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produces graphical output of the marginalized posterior pdjs of all parameters 
and parameter pairs. The posterior pdjs of the parameters for model I are 
shown in Fig. [5] as an example of these marginalized distributions. For the 
one dimensional distributions, the dashed line marks the median. The mean 
and global mode are marked by the diamond and the triangle, respectively. 
The band gives the central 68% probability interval. If the marginalized 
mode for a parameter is at one of the limits, the 95% probability lowest or 
highest interval is shown instead. The median is given at the top of each plot 
along with the range allowed from the central 68% probability interval. In 
the case where a limit is reported, the upper or lower 95% probability value 
is quoted. For the two dimensional distribution the pdf is represented by a 
color code and the open circle marks the global mode. 

7.1.2. Multiple Modes 

It is often the case that there are several modes in the posterior pdf. 
As an example, we look in more detail at the posterior pdjs for model III. 
Figure [6] shows the marginalized pdjs for the amplitude, -Din, and the mean 
of the Gaussian peak, /j,m, and the two-dimensional probability contours for 
the amplitude and mean, and the mean and width of the Gaussian, am- 
As becomes clear from these plots, it is possible for a large fraction of the 
probability distribution to be far from the parameter value of the global 
mode. The global mode can in fact lie outside the central 68% probability 
interval. In the fit of model III to our simulated data, two regions for the 
parameter /xni have significant probability. In one case, a low lying peak 
is located at /im ~ 5, while in the second case there is a peak located at 
//in ~ 17 (see Fig. [6] upper right). The peak at fim ~ 5 corresponds to a 
narrow Gaussian with small amplitude as can be seen in the lower left and 
lower right plot of Fig. O The peak at fim ~ 17 is broad and has a larger 
amplitude. The 2D-projections of the posterior pdf indicate that there are 
several regions with significant probability. 

With BAT the level of information available allows the user to see effects 
such as the multiple modes clearly, and to then react accordingly. In the 
example discussed here, the user may want to redefine the prior ruling out 
the possibility of the wide Gaussian and perform the fit again. 

7.1.3. Definition of Uncertainty Interval 

The central interval is not always the appropriate choice for the definition 
of the uncertainty interval. Another option is to take the narrowest set of 
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Figure 5: Examples of standard output distributions from the BAT program. The 
marginalized pdjs for each of the parameters from model I are shown as well as the prob- 
ability contours from P(Bi, C\\D). Note the log-scale in the bottom right plot. 
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intervals containing 68% of the posterior pdf. Choosing this definition can 
produce discontinuous regions in the parameter space as shown in Fig. The 
uncertainty band on the value of the fit function at a given value of x can 
also be calculated with this approach. As an example, Fig. [8] (left) shows the 
distribution of possible y-values for x = 5.0. The shaded area is the smallest 
set of intervals containing 68% probability. The resulting uncertainty band 
is shown in Fig. [8] (right). 
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Figure 7: The marginalized pdjs for parameters Dm and /ini from model III using the 
smallest set of intervals containing 68% of the probability. 



7.1.4- Model Comparison 

The model comparison can be carried out via the p- values given in Tab. [TJ 
The corresponding p- values for models I-IV are 0.154, 0.025, 0.479 and 0.667, 
respectively. From the p-values, one concludes that models I, III and IV 
are all in good agreement with the data, whereas model II is somewhat 
disfavored. This is consistent with the conclusions one would draw from 
a visual inspection of the plots in Fig. HI For this example, the p-value 
provides a useful goodness- of-fit number summarizing the ability of a model 
to represent the data. 

It is also possible to calculate directly a posterior probability for each 
model to be correct. In this case, the prior probabilities for each model were 
taken to be the same. The posterior probabilities are then calculated accord- 
ing to Eq. ff22l . The results for the four models were 0.88, 7.6- 10~ 6 , 0.12 and 
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Figure 8: Results from fits of model III to the data. Left: An example for the distribution 
of y- values for a fixed value of a; = 5.0. The shaded area is the smallest set of intervals 
containing 68% of the probability. Right: The uncertainty band calculated using this 
definition. 

8.2 • 10~ 3 , respectively. These values are very sensitive to the range allowed 
for the parameter values in the model, and models with more parameters 
are automatically disfavored. Making use of this probability requires a very 
considered choice of the priors used for the models as well as the priors on 
the individual parameters within a model. 

7.2. Example: Poissonian Uncertainties 

In this example, the data were generated assuming the expression in 
Eq. ([52]) is the expectation value for a Poisson distribution in bin i. I.e., 
the number of events in the zth bin, rij, was simulated using a Poisson dis- 
tributed number with mean f(xi). The same models were fitted to the data, 
and the parameter ranges were those specified in Tab. [TJ The data which 
were fitted are shown in Fig. [9j together with the models evaluated using the 
parameter values from the global mode of the fits. 
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Figure 9: The data set as well as the fitted functions for the example with Poissonian 
fluctuations. Upper left, model I; upper right, model II; lower left, model III; lower right, 
model IV. The shaded band shows the uncertainty on the fit functions from the central 
68% interval. 
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This example represents a typical histogram fitting application. The data 
in bins represent the result of a counting experiment, and the likelihood 
function is 

e -/(xi) f( Xi yi 



P(D\X,M) = H 



where f(xi) is the prediction for the mean for bin i for a given model, M. 

In this case, we find that the models are easily distinguished. There is a 
much stronger discrimination between the models since the uncertainties on 
the data are considerably smaller. The distribution of /* (see Eq. (1181) ) for 
10 6 simulated experiments for each of the models is shown in Fig. dOj along 
with the value of f D . The resulting p-values are 0.3 • 1CT 3 , < 3 • 1CT 6 , 0.563 
and 0.551 for models I-IV, respectively. The upper limit on the p-value for 
the second model results from finding simulated experiments with such a 
low likelihood, and corresponds to a 95% upper limit. The models I and II 
can be clearly distinguished from the models III and IV. It is not possible to 
distinguish the latter two models based on the p-value, and in practice the 
model with fewer parameters would be preferred. 

7.2.1. N on- flat priors 

As an example of the influence of the prior used for an individual pa- 
rameter in the extraction of the posterior pdf, a different prior was used for 
the parameter a in models II, III and IV. This restriction could come from a 
known detector resolution, e.g., which would limit the observed width of any 
bumps in the data. Rather than using a flat prior, the following form was 
chosen: 

Poio-) 



0.3V2vr 

No significant differences in the marginalized distributions and the p-values 
were found for models III and IV, since the data already constrain these 
quite well. However, the solution with a large value of an in model II, 
discussed above, was now suppressed, and f D for the fit was reduced. This 
clearly illustrates the need to put the maximum amount of knowledge into 
the fitting procedure in order to get out the best possible results. 
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Figure 10: Distribution of /* for models I-IV (histogram). The value of f D is shown 
the red vertical line. The p- values are calculated based on Eq. (fT8|) . 
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8. Summary 



We have described the development of a general purpose analysis tool 
based on a Bayesian learning algorithm, the BAT package. BAT is based 
on the Markov Chain Monte Carlo technique, and provides all the standard 
quantities such as best fit parameters, goodness-of-fit, upper limits, etc. In 
addition, BAT provides a sampling of the parameters of the model under 
study according to the full posterior probability density function. This allows 
for detailed investigations of correlations between parameters, and the eval- 
uation of the probability density for any function of the parameters, without 
approximations. We believe this extra functionality will be very useful in 
data analysis. 

The BAT package is coded in C++ and is available under 



http://mppmu.mpg.de/bat/. A manual describing the use of BAT is also 



available at that location. 
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